Lab 1

Eric Miao, Sabrina Peng, Fernando Vazquez

CS7324, Fall 2020

1. Business Understanding

1.1. Dataset Introduction

Financial institutions offer a variety of services, including loan approval and disbursement to provide customers with liquid capital for major purchases. Major purchases can include anything from real estate to business PP&E to vehicles. The loan approval process is one that many customers worry about, as the requested loan may or may not be approved (in part or in full) by the institution. Banks use customers' previous and current financial data, as well as human judgment (in the form of loan officer recommendations) to decide whether or not to approve a loan for a customer. Financial data that is crucial to making those decisions include the amount and necessity of the loan, the customer's credit history, and the customer's current financial situation. According to Investopedia's description of loan grading, bigger banks tend to use more quantitative approaches to approximate credit risk, while smaller banks tend to use a more holitsic, human-based judgment system.

This dataset, retrieved from Kaggle, consists of vehicle loan data from Larsen & Toubro, a global conglomerate with a financial services arm, founded in Mumbai, India. The task is to predict whether or not a customer will default on the loan given a demographic, current loan, and past financial performance data - a binary classification task. To do so, we need to do exploratory data analysis via data visualization in order to discover which features are most important to determination of potential loan default (loan quality and credit risk).

1.2. Dataset Description

The dataset includes many features, which can be categorized into the following sections:

  • Customer demographic information, including employment type and age.
  • Customer current vehicle loan information, including the asset's cost, the disbursed loan amount, and the loan-to-value ratio. This represents information about the loan in question.
  • Customer past financial performance information, including the average account history, the credit history length, the number of opened / delinquent accounts in the last six months, and data about previous primary and secondary loans.
  • Customer information not relevant to loan determination.

One of the goals of exploring this dataset is to see what demographic, current loan, and past financial factors play a role in determining probability of loan default.

1.3. Dataset Significance

It is important to be able to predict potential loan default because financial institutions want to not only manage their lending capacity effectively, but also reduce occurrences of:

  1. False positives - classifying a customer as someone who will default on a loan, when he / she ends up not defaulting on the loan. This represents missed business, as the bank did not extend loans to and earn interest from customers who ended up being trustworthy.
  2. False negatives - classifying a customer as someone who will not default on a loan, when he / she ends up defaulting on the loan. This represents a loss to the business, as the bank will either be forced to charge expected loan losses against earnings or go through additional steps to seize the vehicle or any other pertinent valuable assets.

Both of these errors should be avoided in order for banks to remain financially successful in their loan segments.

Once models are built to accomplish this classification task, the prediction algorithm would need to perform better than the current method of data and human-judgment approval process, which, according to this dataset, is around 79% correct in approving loans for people who will not default on them. Of course, the higher percentage accuracy that can be achieved, the better.

1.4. Prediction Task Ethical Implications

Important to note is that there are a few ethical implications of having an algorithm or model decide loan quality, as explained by this article by White & Case LLP ("Algorithms and bias: What lenders need to know"). An algorithmic credit-scoring system helps financial institutions "optimize default and prepayment rates" and "streamlines the [loan] application process." This machine learning-driven approach seeks to remedy the flaws of a human judgment-based system, in which bankers making the approval decisions may have preconceived and subjective biases. However, algorithmic systems are not immune from bias - in fact, they may "inadvertently generate biased conclusions that discriminate against protected classes of people," resulting in certain classes of people automatically being restricted from loan approval even if they are in other ways good candidates for loan approval. Since an algorithmic approach to credit approval is often more of a black box, financial institutions should be wary of any unintended and discriminatory biases that the model might learn from the data, as this may lead to fair lending claims and reputational damage.

An example of a bias in an algorithmic system is the following, taken from the White & Case LLP article linked above:

Consider an algorithm programmed to examine and incorporate certain shopping patterns into its decision model. It may reject all loan applicants who shop primarily at a particular chain of grocery stores because an algorithm "learned" that shopping at those stores is correlated with a higher risk of default. But if those stores are disproportionately located in minority communities, the algorithm could have an adverse effect on minority applicants who are otherwise creditworthy.

Financial services firms must plan to test for and remove any bias that shows up in their algorithmic systems in order avoid legal and reputational risk.

2. Data Understanding

2.1. Data Description

In [1]:
import pandas as pd
import numpy as np

# load the dataset and view first few rows of training dataset
df = pd.read_csv('train.csv')
df.head()
Out[1]:
UniqueID disbursed_amount asset_cost ltv branch_id supplier_id manufacturer_id Current_pincode_ID Date.of.Birth Employment.Type ... SEC.SANCTIONED.AMOUNT SEC.DISBURSED.AMOUNT PRIMARY.INSTAL.AMT SEC.INSTAL.AMT NEW.ACCTS.IN.LAST.SIX.MONTHS DELINQUENT.ACCTS.IN.LAST.SIX.MONTHS AVERAGE.ACCT.AGE CREDIT.HISTORY.LENGTH NO.OF_INQUIRIES loan_default
0 420825 50578 58400 89.55 67 22807 45 1441 01-01-84 Salaried ... 0 0 0 0 0 0 0yrs 0mon 0yrs 0mon 0 0
1 537409 47145 65550 73.23 67 22807 45 1502 31-07-85 Self employed ... 0 0 1991 0 0 1 1yrs 11mon 1yrs 11mon 0 1
2 417566 53278 61360 89.63 67 22807 45 1497 24-08-85 Self employed ... 0 0 0 0 0 0 0yrs 0mon 0yrs 0mon 0 0
3 624493 57513 66113 88.48 67 22807 45 1501 30-12-93 Self employed ... 0 0 31 0 0 0 0yrs 8mon 1yrs 3mon 1 1
4 539055 52378 60300 88.39 67 22807 45 1495 09-12-77 Self employed ... 0 0 0 0 0 0 0yrs 0mon 0yrs 0mon 1 1

5 rows × 41 columns

In [2]:
# set all column names to lowercase and replace . with -
df = df.rename(str.lower, axis='columns')
df.columns = df.columns.str.replace('.','_')
In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 233154 entries, 0 to 233153
Data columns (total 41 columns):
uniqueid                               233154 non-null int64
disbursed_amount                       233154 non-null int64
asset_cost                             233154 non-null int64
ltv                                    233154 non-null float64
branch_id                              233154 non-null int64
supplier_id                            233154 non-null int64
manufacturer_id                        233154 non-null int64
current_pincode_id                     233154 non-null int64
date_of_birth                          233154 non-null object
employment_type                        225493 non-null object
disbursaldate                          233154 non-null object
state_id                               233154 non-null int64
employee_code_id                       233154 non-null int64
mobileno_avl_flag                      233154 non-null int64
aadhar_flag                            233154 non-null int64
pan_flag                               233154 non-null int64
voterid_flag                           233154 non-null int64
driving_flag                           233154 non-null int64
passport_flag                          233154 non-null int64
perform_cns_score                      233154 non-null int64
perform_cns_score_description          233154 non-null object
pri_no_of_accts                        233154 non-null int64
pri_active_accts                       233154 non-null int64
pri_overdue_accts                      233154 non-null int64
pri_current_balance                    233154 non-null int64
pri_sanctioned_amount                  233154 non-null int64
pri_disbursed_amount                   233154 non-null int64
sec_no_of_accts                        233154 non-null int64
sec_active_accts                       233154 non-null int64
sec_overdue_accts                      233154 non-null int64
sec_current_balance                    233154 non-null int64
sec_sanctioned_amount                  233154 non-null int64
sec_disbursed_amount                   233154 non-null int64
primary_instal_amt                     233154 non-null int64
sec_instal_amt                         233154 non-null int64
new_accts_in_last_six_months           233154 non-null int64
delinquent_accts_in_last_six_months    233154 non-null int64
average_acct_age                       233154 non-null object
credit_history_length                  233154 non-null object
no_of_inquiries                        233154 non-null int64
loan_default                           233154 non-null int64
dtypes: float64(1), int64(34), object(6)
memory usage: 72.9+ MB

There are 233,154 auto loan customers in this dataset. The above information about the dataset shows that there is almost no missing data - there are 233,154 non-null values for most columns, except the employment_type field. Fields that start with pri pertain to primary accounts, those which the customer has taken for his personal use. Fields that start with sec pertain to secondary accounts, those which the customer has acted as a co-applicant or guarantor.

Many of the features are of the correct data type, but a few require additional preprocessing (conducted in the next section) in order to be usable. These include all of the date fields, which are currently strings and should be integer types (in terms of # of days / months / years since the date). In addition, customer age and credit score values can be segmented into different categories to produce categorical features.

In [4]:
assert len(df) == len(df.uniqueid.unique())

The cell above ensures that all customers (rows) are unique by checking to see that the number of rows matches the number of the number of unique IDs in the uniqueid field.

In [5]:
print("Percentage of vehicle loans defaulted on: ", sum(df.loan_default == 1) / len(df) * 100)
Percentage of vehicle loans defaulted on:  21.70711203753742

One thing to look out for when utilizing machine learning models in classification tasks is whether or not the dataset is imbalanced - if the number of instances of one class is very small or large in comparison to other classes. In the case of our binary classification task, about 22% of vehicle loans in the dataset resulted in a loan default. This shows that the dataset is not too drastically imbalanced.

2.2. Data Preprocessing

In [6]:
import matplotlib
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
%matplotlib inline

import missingno as mn

mn.matrix(df)
plt.show()
In [7]:
# get percentage of null rows

null_rows_df = df[df.employment_type.isnull()]
print("Number of rows with missing data: ", len(null_rows_df))
print("Percentage of rows with missing data: ", len(null_rows_df) / len(df) * 100)
print("Percentage of null rows where loans were defaulted on: ", 
      len(null_rows_df[null_rows_df.loan_default == 1]) / len(null_rows_df) * 100)
Number of rows with missing data:  7661
Percentage of rows with missing data:  3.2858110948128703
Percentage of null rows where loans were defaulted on:  21.45933951181308
In [8]:
# drop rows with null values in employment type field

df_new = df.copy()
df_new = df_new.dropna()
print("New number of rows: ", len(df_new))
New number of rows:  225493

As mentioned above, the vast majority of data is present and non-null in the dataset. This can be confirmed by the filter bar missingno matrix, which shows only missing data (white bars) in the employment_type feature column.

For the rows in which employment_type is null, we drop those rows, as the categorical employment type (either Salaried or Self Employed) cannot be successfully imputed with any sort of certainty, and the affected rows only comprise 3% of the total dataset. In addition, the percentage of affected rows where the loan was defaulted on was about 21%, which is very close to the 22% of total rows where the loan was defaulted on.

Therefore, we decided to drop the rows with missing employment type data, instead of attempting to impute employment_type for those instances.

In [9]:
# change date of birth to more consumable age value
from datetime import datetime, timedelta
from dateutil.relativedelta import relativedelta

# code adapted from https://stackoverflow.com/questions/765797/python-timedelta-in-years 
# to convert birthdate to years since birthdate
def convert_to_years(date):
    datetime_val = pd.to_datetime(date, format='%d-%m-%y')
    if datetime_val > datetime.now():
        return relativedelta(datetime.now(), datetime_val).years + 100
    return relativedelta(datetime.now(), datetime_val).years
In [10]:
df_new["customer_age"] = df_new.date_of_birth.apply(convert_to_years)
In [11]:
# https://stackoverflow.com/questions/4628122/how-to-construct-a-timedelta-object-from-a-simple-string
# convert average account age and credit history length from '_yrs _mon' format 
# to number of months using regex matching

import re
regex = re.compile(r'(?P<year>\d+?)yrs (?P<month>\d+?)mon')

def parse_time_delta(time_delta_str):
    parts = regex.match(time_delta_str)
    if not parts:
        return
    parts = parts.groupdict()
    time_params = {}
    for (name, param) in parts.items():
        if param:
            time_params[name] = int(param)
            
    return relativedelta(**time_params)

def get_num_months(time_delta_str):
    delta = parse_time_delta(time_delta_str)
    return int(delta.year) * 12 + delta.month
In [12]:
df_new["average_acct_age"] = df_new["average_acct_age"].apply(get_num_months)
df_new["credit_history_length"] = df_new["credit_history_length"].apply(get_num_months)

Some modifications were needed to make the date_of_birth, average_acct_age, and credit_history_length features usable. Date of birth was converted to customer_age (in years), while average account age and credit history length were converted from a custom string format ("_yrs _mon") to an integer representing number of months.

In [13]:
# delete any unnecessary columns

unnecessary_cols = ["uniqueid", "branch_id", "supplier_id", "manufacturer_id", "current_pincode_id",
                    "date_of_birth", "state_id", "employee_code_id", "disbursaldate"]

flags = ["mobileno_avl_flag", "aadhar_flag", "pan_flag", "voterid_flag", "driving_flag", "passport_flag"]
In [14]:
df_new = df_new.drop(columns=(unnecessary_cols + flags))
print("Number of new features (excluding the target variable): ", len(df_new.columns)-1)
df_new.head()
Number of new features (excluding the target variable):  26
Out[14]:
disbursed_amount asset_cost ltv employment_type perform_cns_score perform_cns_score_description pri_no_of_accts pri_active_accts pri_overdue_accts pri_current_balance ... sec_disbursed_amount primary_instal_amt sec_instal_amt new_accts_in_last_six_months delinquent_accts_in_last_six_months average_acct_age credit_history_length no_of_inquiries loan_default customer_age
0 50578 58400 89.55 Salaried 0 No Bureau History Available 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 36
1 47145 65550 73.23 Self employed 598 I-Medium Risk 1 1 1 27600 ... 0 1991 0 0 1 23 23 0 1 35
2 53278 61360 89.63 Self employed 0 No Bureau History Available 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 35
3 57513 66113 88.48 Self employed 305 L-Very High Risk 3 0 0 0 ... 0 31 0 0 0 8 15 1 1 26
4 52378 60300 88.39 Self employed 0 No Bureau History Available 0 0 0 0 ... 0 0 0 0 0 0 0 1 1 42

5 rows × 27 columns

We removed columns / features that had data that was not useful to loan default determination, including categorical data flags that indicated whether or not the customer disclosed information such as mobiel number, voter ID, and driver's license numbers. Disbursal date would normally be useful, but we don't know the exact date this data was collected, and so it is difficult to extract useful information (ex. months_since_disbursal).

In [15]:
df_new["age_range"] = pd.cut(x=df_new["customer_age"], 
                             bins=[0, 35, 50, 65, 1e6],
                             labels=['Young Career','Middle Career','Late Career','Retirement Age'])
df_new.age_range.describe()
Out[15]:
count           225493
unique               4
top       Young Career
freq            120974
Name: age_range, dtype: object
In [16]:
df_new["credit_score_category"] = pd.cut(x=df_new["perform_cns_score"], 
                                         bins=[0, 450, 600, 750, 950],
                                         labels=['Bad Credit','Poor Credit','Good Credit','Great Credit'])
df_new.credit_score_category.describe()
Out[16]:
count          113720
unique              4
top       Good Credit
freq            49131
Name: credit_score_category, dtype: object

Above, we took two numeric features, customer age and credit bureau score, and extracted four categories for each feature to produce the age_range and credit_score_category categorical features. Simplifying age and credit scores down to four descriptions allows us to use crosstab operations in our data visualizations to see which subgroups have larger levels of default. Credit score breakdowns are based on this Credit Karma article.

This leaves us with two categorical features - employment_type and perform_cns_score_description, as well as 26 other numerical features. Not all 28 feature columns will be used in the visualizations and analysis below, but we are keeping them in the dataframe in case of future use.

In [17]:
print("Final Look at Preprocessed Dataset")
df_new.head()
Final Look at Preprocessed Dataset
Out[17]:
disbursed_amount asset_cost ltv employment_type perform_cns_score perform_cns_score_description pri_no_of_accts pri_active_accts pri_overdue_accts pri_current_balance ... sec_instal_amt new_accts_in_last_six_months delinquent_accts_in_last_six_months average_acct_age credit_history_length no_of_inquiries loan_default customer_age age_range credit_score_category
0 50578 58400 89.55 Salaried 0 No Bureau History Available 0 0 0 0 ... 0 0 0 0 0 0 0 36 Middle Career NaN
1 47145 65550 73.23 Self employed 598 I-Medium Risk 1 1 1 27600 ... 0 0 1 23 23 0 1 35 Young Career Poor Credit
2 53278 61360 89.63 Self employed 0 No Bureau History Available 0 0 0 0 ... 0 0 0 0 0 0 0 35 Young Career NaN
3 57513 66113 88.48 Self employed 305 L-Very High Risk 3 0 0 0 ... 0 0 0 8 15 1 1 26 Young Career Bad Credit
4 52378 60300 88.39 Self employed 0 No Bureau History Available 0 0 0 0 ... 0 0 0 0 0 1 1 42 Middle Career NaN

5 rows × 29 columns

3. Data Visualizations

3.1. Basic Data Visualizations

In [18]:
primary_vars_to_use = ['pri_no_of_accts', 'pri_active_accts', 'pri_overdue_accts', 
                       'pri_current_balance', 'pri_sanctioned_amount',
                       'pri_disbursed_amount', 'primary_instal_amt']
                       
secondary_vars_to_use = ['sec_no_of_accts', 'sec_active_accts', 'sec_overdue_accts', 
                         'sec_current_balance', 'sec_sanctioned_amount',
                         'sec_disbursed_amount', 'sec_instal_amt']

curr_loan_vars_to_use = ['disbursed_amount', 'asset_cost', 'ltv']

past_perf_vars_to_use = ['perform_cns_score', 'new_accts_in_last_six_months',
                         'delinquent_accts_in_last_six_months', 'average_acct_age', 'credit_history_length', 
                         'no_of_inquiries']

customer_info_vars_to_use = ['employment_type', 'customer_age', 'age_range']

For easy access later, we segment the current features that we have into different groups based on what kind of information they contribute - primary loan, secondary loan, current loan, past performance, and customer demographic information. All features are numeric except the ones in customer_info_vars_to_use.

3.1.1. Customer Age and Credit Score Histograms / KDEs

In [19]:
fig = plt.figure(figsize=(20,5))

plt.subplot(1,3,1)
df_new.customer_age.plot.hist(bins=20)
plt.title("Customer Age Histogram")
plt.xlabel("Customer Age (years)")

plt.subplot(1,3,2)
plt.title("Customer Age Kernel Density Estimation")
plt.xlabel("Customer Age (years)")
df_new.customer_age.plot.kde(bw_method=0.2)

plt.subplot(1,3,3)
df_new.customer_age.plot.hist(bins=20)
df_new.customer_age.plot.kde(bw_method=0.1, secondary_y=True)
plt.title("Customer Age Histogram and KDE")
plt.xlabel("Customer Age (years)")
plt.ylim([0, 0.06])

plt.show()

The histogram and kernel density estimation graphs for the customer_age numeric variable show that there are more younger (early-career and mid-career) individuals looking to get loans for vehicle purchases. This brings up the question of whether younger customers have good or bad financial performance metrics, and if they are more likely than older customers to default on loans.

In [20]:
fig = plt.figure(figsize=(20,5))

plt.subplot(1,3,1)
df_new.perform_cns_score.plot.hist(bins=20)
plt.title("Bureau Credit Score Histogram")
plt.xlabel("Bureau Credit Score (points)")

plt.subplot(1,3,2)
df_new.perform_cns_score.plot.kde(bw_method=0.2)
plt.title("Bureau Credit Score Histogram")
plt.xlabel("Bureau Credit Score (points)")

plt.subplot(1,3,3)
df_new.perform_cns_score.plot.hist(bins=20)
df_new.perform_cns_score.plot.kde(bw_method=0.1, secondary_y=True)
plt.title("Bureau Credit Score Histogram and KDE")
plt.xlabel("Bureau Credit Score (points)")
plt.ylim([0, 0.06])

plt.show()
In [21]:
score_risk_levels = ['A-Very Low Risk', 'B-Very Low Risk', 'C-Very Low Risk', 'D-Very Low Risk', 
                     'E-Low Risk', 'F-Low Risk', 'G-Low Risk', 
                     'H-Medium Risk', 'I-Medium Risk', 
                     'J-High Risk', 'K-High Risk',
                     'L-Very High Risk', 'M-Very High Risk']

score_exception_levels = ['No Bureau History Available', 
                          'Not Scored: Not Enough Info available on the customer',
                          'Not Scored: No Activity seen on the customer (Inactive)',
                          'Not Scored: Sufficient History Not Available',
                          'Not Scored: No Updates available in last 36 months',
                          'Not Scored: Only a Guarantor',
                          'Not Scored: More than 50 active Accounts found']

df_defined_risk_levels = df_new[df_new.perform_cns_score_description.isin(score_risk_levels)]
print("Number of instances with defined risk score category: ", len(df_defined_risk_levels))

df_undefined_risk_levels = df_new[df_new.perform_cns_score_description.isin(score_exception_levels)]
print("Number of instances with undefined risk score category: ", len(df_undefined_risk_levels))
print("Undefined risk level credit bureau scores: ", sorted(df_undefined_risk_levels.perform_cns_score.unique()))
Number of instances with defined risk score category:  101240
Number of instances with undefined risk score category:  124253
Undefined risk level credit bureau scores:  [0, 11, 14, 15, 16, 17, 18]

The histogram and kernel density estimation graphs for the perform_cns_score numeric variable show that that more than half the customers in the dataset have a credit score of 0, which most likely correspond to the perform_cns_score_description values of "No Bureau History" or "Not Scored: {reason}." Score descriptions that were deemed defined and undefined (score_risk_levels and score_exception_levels, respectively) were listed above from the total number of unique descriptions. As seen above, undefined risk level credit bureau scores were all less than 20.

This means that in certain future data visualization and analysis concerning the credit scores, we may need to remove all instances where customers have undefined risk category descriptions (listed above in score_exception_levels), whether this is due to very poor financial performance or customers who are just starting out building credit.

3.1.2. Customer Age, Employment Type, and Loan-to-Value Plots

In [22]:
import seaborn as sns

df_add_ltv = df_new[['employment_type','customer_age','age_range','ltv']]
sns.barplot(x=df_add_ltv.employment_type, y=df_add_ltv.ltv, hue="age_range",data=df_add_ltv)
plt.title("Measuring LTV ratio against Customer Employment and Age Range")
plt.show()

We want to analyze customer age and employment type in relation to ltv, which stands for the loan-to-value ratio and incorporates information about the current loan in question. Based on the barplot, we can see that career experience may be a determinant when it comes to ltv and how much a bank is willing to grant to an individual seeking a loan. The middle career and late career groups seem to have higher ltv compared to young career and retirement-aged customers, indicating financial stability is something that banks look at in the credit underwriting process.

Next, we want to visualize customer age and employment type in relation to our target variable, the binary variable loan_default.

In [23]:
df_sample = df_new.sample(frac=0.10)
print("Size of subsample: ", len(df_sample))
Size of subsample:  22549

We have a large number of instances, so to speed up the computation of our box, violin, and swarmplots, we decided to randomly sample annd use only 10% of the data for the following visualizations. 10% of our dataset is around 22,500 instances, which is still a nontrivial amount of data.

In [24]:
plt.subplots(figsize=(20, 8))

plt.subplot(1,3,1)
sns.boxplot(x="employment_type", y="customer_age", hue="loan_default", data=df_sample)
plt.title('Customer Age vs. Employment Type Boxplot')

plt.subplot(1,3,2)
sns.violinplot(x="employment_type", y="customer_age", hue="loan_default", data=df_sample)
plt.title('Customer Age vs. Employment Type Violinplot')

plt.subplot(1,3,3)
sns.swarmplot(x="employment_type", y="customer_age", hue="loan_default", data=df_sample)
plt.title('Customer Age vs. Employment Type Swarmplot')

plt.show()

Here, we plot a boxplot, violinplot, and swarmplot plotting employment_type against customer_age. We decided to plot the boxplot because we can examine the overall distribution of our customer age variable. We added the violin and swarm plots to examine any unusual modality or clusters in the data, and because boxplots alone are not reliable, as many different distributions can have the same boxplot and we may be missing underlying features of the distribution.

By comparing these two variables and looking at the instances of loan default in this context:

  • The boxplot shows that the average individual who defaults on a loan (no matter employment status) is younger than one that does not default on a loan.
  • The violinplot shows that self-employed customers applying for vehicle loans are generally older. This makes sense because self-employment usually comes later on in life, and requires more time to pass in order to establish financial stability.
  • The swarmplot does not yield any significant clustering or distributional information, as loan defaults happen at every age and in both employment types.

3.1.3. Important Feature Correlation Matrix

In [25]:
vars_to_use = ['loan_default', 'customer_age', 'employment_type', 'perform_cns_score',
               'disbursed_amount', 'asset_cost', 'ltv',
               'pri_no_of_accts', 'pri_active_accts', 'pri_overdue_accts', 'pri_current_balance',
               'new_accts_in_last_six_months', 'delinquent_accts_in_last_six_months',
               'no_of_inquiries']

df_important_cols = df_new[vars_to_use]
In [26]:
import seaborn as sns

cmap = sns.set(style="darkgrid")
f, ax = plt.subplots(figsize=(15, 10))
sns.heatmap(df_important_cols.corr(), cmap=cmap, annot=True)
plt.title("Selected Feature Correlation Matrix")
plt.show()

Here, we've visualized a correlation matrix between a subset of the dataset features that we deemed most important to determination of potential loan default (based on description of the features in the dataset's included metadata). There are some highly positive correlations that are expected - for example, the disbursed amount for the loan is going to be greater as the vehicle asset cost increases, since customers will likely request a greater loan, and the number of primary active accounts is directly related to the number of accounts opened in the last six months.

Of particular interest here are the following observations:

  • Loan default is not strongly correlated (either positively or negatively) with any feature - this is most likely due to the fact that the loan_default target variable is binary and therefore not very conducive to correlation matrix comparison.
  • The number of delinquent accounts (loans defaulted) in the last six months in the customer's name is pretty highly correlated with the primary number of accounts, the primary number of active accounts, and the primary number of overdue accounts. This may indicate that the more accounts a customer currently has and is struggling with paying for, the more likely the customer will become delinquent. These factors could be indications of future financial performance, as past history will certainly affect the future.
  • Nothing stands out as distinctly negatively correlated besides the loan-to-value ratio and asset cost variables, which is to be expected.

3.2. Dataset-Specific Questions and Visualizations

3.2.1. Crosstab Subgrouping

Questions: Does age range have an effect on loan default? Does employment type have an effect on loan default? Does credit score description / range have an effect on loan default?

In [27]:
age_to_default_percentage = pd.crosstab([df_new['age_range']], df_new.loan_default.astype(bool), normalize='index')
print(age_to_default_percentage)

age_to_default_percentage.plot(kind='barh', stacked=True)
plt.title("Percentage of Loan Defaults based on Age Range Subgroups")
plt.xlabel("Percentage of No Loan Default vs. Loan Default in Subgroup")
plt.show()
loan_default       False     True 
age_range                         
Young Career    0.770364  0.229636
Middle Career   0.794992  0.205008
Late Career     0.805416  0.194584
Retirement Age  0.864662  0.135338

This crosstab operation, which groups customer instances by age_range, shows that as customer age increases, the less likely loan default is to happen. Young career individuals have the greatest percentage of defaults, whereas retirement age individuals have the lowest percentage. This is in line with a general assumption that older customers have more financial stability due to accumulation of wealth and financial knowledge over their careers.

In [28]:
emp_default = pd.crosstab([df_defined_risk_levels['employment_type']], 
                                 df_defined_risk_levels.loan_default.astype(bool),
                                 normalize='index')
print(emp_default)

emp_default.plot(kind='barh', stacked=True)
plt.title("Percentage of Loan Defaults based on Credit Score Category Subgroups")
plt.xlabel("Percentage of No Loan Default vs. Loan Default in Subgroup")
plt.show()
loan_default        False     True 
employment_type                    
Salaried         0.815937  0.184063
Self employed    0.785466  0.214534

This crosstab operation, which groups customer instances by employment_type, shows salaried individuals are slightly less likely to default on loans than self-employed individuals. This is in line with a general assumption that salaried customers tend to probably have more financial stability than customers who are self-employed.

In [29]:
score_default = pd.crosstab([df_new['perform_cns_score_description']], 
                                 df_new.loan_default.astype(bool),
                                 normalize='index')

score_default.plot(kind='barh', stacked=True, figsize=(8,6))
plt.title("Percentage of Loan Defaults based on Credit Score Category Subgroups")
plt.xlabel("Percentage of No Loan Default vs. Loan Default in Subgroup")
plt.show()

The graph shown above visualizes the percentage of customers in each original bureau score subgroup that do and do not default on their vehicle loans. As one might expect, loan_default is generally correlated with risk category level. The highest percentage of loan default came from customers classified as M-Very High Risk, and the lowest percentage of loan default came from customers classified as B-Very Low Risk. As you increase the risk category level from A-M and Very Low Risk-Very High Risk, there is a gradual increase in percentage of default. An interesting observation is that customers classified as I-Medium Risk tended to have similar default percentages to K-High Risk and L-Very High Risk at around 27%, which indicates that defaulting customers in this bureau score level may be eventually shifted to a higher risk score as the loan default information will reflect on their future financial scores.

In [30]:
score_default = pd.crosstab([df_new['age_range'], df_new['credit_score_category']],
                             df_new.loan_default.astype(bool),
                             normalize='index')

score_default.plot(kind='barh', stacked=True, figsize=(8,6))
plt.title("Percentage of Loan Defaults based on Credit Score and Age Range Subgroups")
plt.xlabel("Percentage of No Loan Default vs. Loan Default in Subgroup")
plt.show()

To finalize our analysis for this question, we created a plot that creates subgroups for each categorical credit score category and age range categorical pair and calculated the percentage of loan default in each subgroup. Not surprisingly, the subgroups with bad or poor credit had more loan default instances than subgroups with good or great credit. We've seen from previous analyses that age range does play a role in loan default, but this chart shows that past financial performance (in the form of a credit score) has an even greater role.

3.2.2. Scatter Plots

Question: What "other loan" information (primary and secondary loan features) affect a customer's probability of defaulting on a loan? In other words, how indicative are past financial decisions on current and future financial performance?

In [31]:
df_primary_jitter = df_new[primary_vars_to_use].copy()
df_primary_jitter['loan_default'] = df_new.loan_default
sns.pairplot(df_primary_jitter, hue="loan_default", vars=primary_vars_to_use, 
             height=2, plot_kws=dict(s=20, alpha=0.15, linewidth=0))
plt.show()
In [32]:
df_secondary_jitter = df_new[secondary_vars_to_use].copy()
df_secondary_jitter['loan_default'] = df_new.loan_default
sns.pairplot(df_secondary_jitter, hue="loan_default", vars=secondary_vars_to_use,
             height=2, plot_kws=dict(s=20, alpha=0.15, linewidth=0))
plt.show()

Here, we visualize two scatter plots, one with past primary loan information and one with past secondary loan information. We wanted to see if there were any groups of primary or secondary loan features that would be useful for inclusion in principal component analysis and future classification, since these features were more than half of the dataset and we wanted to reduce the dataset's dimensionality.

What we can see is that there are certain expected linear relationships (i.e. between current_balance, sanctioned_amount, and disbursed_amount in both the primary and secondary set of features), but there is no very clear indication of separation in the plots between instances of loan default and instances of no loan default, which means that we should use PCA or some other clustering algorithms to more easily find "groups" of loan customers, from which banks can decide to whether to approve loans for based on the group's feature characteristics.

The most interesting observation from these plots is that "no default" instance customers tend to have larger values for most of the features. Interestingly, there are customers with very large current balances and number of overdue accounts, and still yet manage to not default on their vehicle loan. This could either be an indication that they are prioritizing this loan (since it is a large purchase), or they had not yet defaulted at the time this data was gathered. Normally, a greater number of loans would mean increased credit risk, as these people are more levered / in debt, but it turns out that it might also mean that these customers are more cognizant of their financial situation and are working hard to avoid default.

3.2.3 Principal Component Analysis

Question: What weighted combination of numeric features best describes the given data? In other words, how can we can reduce the number of features we look at in our model and construct new ones that might be useful in our binary classification task?

Sub-question: There is a lot of primary and secondary loan information which inflates the number of features. Are all of these necessary? Why or why not?

In [33]:
import plotly.express as px
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA

pca_chosen_features = list(df_new.select_dtypes(include=['number']))
pca_chosen_features.remove('loan_default')

df_sklearn = df_new[pca_chosen_features]
df_y = df_new['loan_default']
In [34]:
# plot explained variance

pca = PCA()
pca.fit(df_sklearn)
exp_var_cumul = np.cumsum(pca.explained_variance_ratio_)

px.area(
    x=range(1, exp_var_cumul.shape[0] + 1),
    y=exp_var_cumul,
    labels={"x": "# Components", "y": "Explained Variance"}
)
In [35]:
# pca code adapted from https://plotly.com/python/pca-visualization/
# visualizing a subset of the principal components that explains almost 99% of the total variance

n_components = 2
pca = PCA(n_components=n_components)
components = pca.fit_transform(df_sklearn)

total_var = pca.explained_variance_ratio_.sum() * 100

labels = {str(i): f"PC {i+1}" for i in range(n_components)}
labels['color'] = 'loan_default'

fig = px.scatter_matrix(
    components,
    color=df_y,
    dimensions=range(n_components),
    labels=labels,
    title=f'Total Explained Variance: {total_var:.2f}%',
)
fig.update_traces(diagonal_visible=False)

As a segway to the next section where we use UMAP to conduct principal component analysis, we utilize the PCA implementation from sklearn and plotting functions from plotly to answer our question of what weighted combination of numeric features best describes our data.

We first take all of the numeric features from the original dataset (which is 24, or a majority, of our features) and ran PCA with no parameters to find how many components we should specify. The explained variance ratio graph shows that almost 99% of the variation in the data can be represented by only two principal components, so we set n_components to 2.

Our results are less than satisfactory - there are no clear clusters that represent individuals that default on vehicle loans and individuals that do not. This could be due to our dataset not being linearly distributed and being best described with geometric parameters. In this case, kernel PCA may be another approach to find principal components [https://www.cs.cmu.edu/~elaw/papers/pca.pdf].

4. Dimensionality Reduction with UMAP

In [42]:
# code adapted from https://umap-learn.readthedocs.io/en/latest/basic_usage.html

import plotly.express as px
from sklearn.decomposition import PCA
import umap.umap_ as umap

# separate data and target
df_umap = df_new.copy()
df_umap = df_umap[pca_chosen_features]
df_umap["loan_default"] = df_new.loan_default
df_umap = df_umap.sample(frac=0.10)

X = df_umap.loc[:, df_umap.columns != "loan_default"]
y = df_umap["loan_default"]
In [43]:
# declare reducer and call fit / transform to get embeddings

reducer = umap.UMAP(random_state=42, n_neighbors=10, min_dist=0.1)
reducer.fit(X)
embedding = reducer.transform(X)
print(embedding.shape)
(22549, 2)
In [44]:
# plot umap results

plt.scatter(embedding[:, 0], embedding[:, 1], c=y, cmap='Spectral', s=5)
plt.gca().set_aspect('equal', 'datalim')
plt.colorbar(boundaries=np.arange(3)-0.5).set_ticks(np.arange(2))
plt.title('UMAP projection of Vehicle Loan Dataset', fontsize=12);

UMAP stands for Uniform Manifold Approximation and Projection. It is a dimension reduction technique that can be used for visualization but also for general non-linear dimension reduction. The UMAP algorithm can be described as two different phases:

  • The first phase consists of constructing a fuzzy topological representation.
  • The second phase is simply optimizing the low dimensional representation to have as close a fuzzy topological representation as possible as measured by cross entropy.

Eventually, we get a fast PCA algorithm that can handle large datasets and high dimensional data without too much difficulty, scaling beyond what most t-SNE (t-Distributed Stochastic Neighbor Embedding) packages can manage.

We use the UMAP library to construct and train a model on all of the numeric features in the dataset, then plot the results of UMAP embeddings. Due to limited computational power, we only utilize a subset of instances that is randomly sampled from the entire dataset. Similar to our PCA analysis, there are not discrete clusters that represent either of the target classes. The radial nature of the resultant UMAP plotting is possibly an indication that the dataset's features are non-linear.

From our two dimensionality reduction experiments, we can tentatively conclude that using PCA on this dataset is not ideal. There are more work and other techniques that can be used to achieve results similar to those we expected from the PCA algorithms.

UMAP Citation

@article{mcinnes2018umap-software, title={UMAP: Uniform Manifold Approximation and Projection}, author={McInnes, Leland and Healy, John and Saul, Nathaniel and Grossberger, Lukas}, journal={The Journal of Open Source Software}, volume={3}, number={29}, pages={861}, year={2018} }